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Abstract 

We study the network structure of Wikipedia (restricted to its math- 
ematical portion), MathWorld, and DLMF. We approach these three on- 
line mathematical libraries from the perspective of several global and local 
network-theoretic features, providing for each one the appropriate value 
or distribution, along with comparisons that, if possible, also include the 
whole of the Wikipedia or the Web. We identify some distinguishing char- 
acteristics of all three libraries, most of them supposedly traceable to the 
libraries' shared nature of relating to a very specialized domain. Among 
these characteristics are the presence of a very large strongly connected 
component in each of the corresponding directed graphs, the complete ab- 
sence of any clear power laws describing the distribution of local features, 
and the rise to prominence of some local features (e.g., stress centrality) 
that can be used to effectively search for keywords in the libraries. 

Keywords: Online mathematical libraries, Wikipedia, MathWorld, DLMF, 
complex networks, text search. 
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1 Introduction 



Until a few decades ago, before it became commonplace to search the Web for 
information and knowledge, people desiring quick access to some mathematical 
concept or formula used to resort to printed encyclopedias or handbooks, such as 
the compilation by Abramowitz and Stegun [Ij , the more specialized tables put 
together by Gradshteyn and Ryzhik :26^, or still others [551 [TS]. Of such volumes, 
the undisputed citations champion seems to be Abramowitz and Stegun's [5], 
whose work has since been methodically expanded [32] into a NIST-sponsored 
pubhcation [41]. 

Lately, though, the situation has become, if anything, more complex. For, 
while those printed works continue to be used and cited widely and their ranks 
continue to be enlarged by the addition of new works of a similar genre [25] . 
the premier source, at least for a first approach, has undoubtedly become the 
Web. In fact, it seems safe to state that most mathematics-related queries 
on Google return Wikipedic0 or Wolfram MathWorlcd pages as prominently 
ranked. As mentioned, however, printed and online material still coexist and, 
curiously, movement has taken place in both directions: while in one direction 
MathWorld material has found its way into Weisstein's encyclopedia [47], in the 
other the NIST volume has been turned into the Digital Library of Mathematical 
Functions, DLMfH 

Here we aim to explore the structure of mathematical knowledge as reflected 
in these three online libraries. By "structure" we do not mean the organization 
of material into the many mathematical areas and subareas. Nor do we mean 
the coalescence of all deduction chains that is behind all of mathematics and 
inherently amounts to an acyclic directed graph |17j . i.e., one with no directed 
cycles. We mean, rather, the no longer acyclic directed graphs that reflect all 
the cross-referencing that took place as those libraries were created by several 
collaborators (and still takes place as the libraries evolve). Exploring their graph 
structures from the perspective of such hypertextual interconnections amounts 
to applying some of the complex-network notions and metrics developed during 
the past fifteen years or so, much as has been done so successfully to various 
other fields [HllillllO]. 

It also amounts to a chance to globally view all the material compiled into 
each library and inquire, from a network-theoretic perspective, what traces re- 
main, if any, as telltale signs of the essentially very distinct methods of con- 
struction employed to build them, all of a collaborative nature but supposedly 
more and more controlled as we move from Wikipedia to MathWorld and then 
to DLMF. In our analyses we use several frequency data, of both a network- 
wide nature as well as node-related, aiming not only to describe the libraries' 
properties as such data reveal them, but also to discover how these properties 
relate to the libraries' robustness in the face of accidental or intentional loss of 
material and to their ease of search in response to text queries. 

^http: //en. wikipedia. org/wiki/Portal : Mathematics, 
^http : //mathworld . wolfram .com. 
■^http : //dlmf . nist . gov. 
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What has turned up is a cohection of results that both sets the three math- 
ematical libraries apart from the wider English-language Wikipedia and from 
the much wider Web, and at the same time groups the three libraries together 
insofar as they share important properties. Some of the most significant results 
include the characteristic that a very large fraction of each library's pages are 
packed together in the sense of mutual reachability; the presence of clear signs 
that all three libraries result from decisions regarding the deployment of links 
that are leveraged by technical knowledge (rather than, say, some nontechni- 
cal measure of a page's relevance, such as popularity); and the discovery of 
successful criteria for guiding text search within the libraries' pages that differ 
significantly from those most commonly used (e.g., by Google). 

We proceed in the following manner. First, in Section [2l we introduce the 
five directed graphs that we use in all analyses (two for Wikipedia, two for 
MathWorld, one for DLMF) and also some basic notation. We then move, 
respectively in Sections |3] and SI to a study of these graphs' global and local 
network-theoretic features. Section [S] is dedicated to an analysis of the five 
graphs' robustness when nodes are lost either as a result of some random process 
or as a deterministic function of the graphs' local features. We continue with 
Section[Sl where we investigate the effect of such features in the ranking of nodes 
when responding to text queries. We conclude in Section [71 

2 Five directed graphs 

In all three libraries it is possible to reach the technical-content pages by navi- 
gating through a hierarchy of specialized subdivisions from the main portal (the 
so-called category pages). Once the content pages are reached, further naviga- 
tion is possible through the links that lead from one such page to another. Each 
of the directed graphs with which we work has a node for each content page and 
directed edges that reflect inter-page links. In all cases, links leading from a page 
to itself are ignored when building the graph, so no self-loops exist. Similarly, 
should multiple links exist from a page to another, only one edge is created in 
the graph between the corresponding nodes. 

In the case of Wikipedia and MathWorld, links can be categorized into those 
appearing in a page's main text and those that are given in the page's "See also" 
section when it exists. We perceive these two link types as playing entirely 
different roles. While in-text links are generally meant to clarify some of the 
terms used in the page, being therefore meant for quick side lookups before 
continuing on the main text. See- also links are used to point to pages where 
related material is to be found. For this reason, we use two different graphs 
for each of Wikipedia and MathWorld. They both have the same node set, but 
their edge sets differ, one reflecting in-text as well as See-also links, the other 
reflecting See-also links only. 

The case of DLMF requires no such special treatment. Although its pages, 
too, contain special, "Referenced by" links, such links are simply antiparallel 
versions of the library's non-Referenced-by links. That is, page a contains a 



3 



Table 1: Online libraries and corresponding directed graphs. 



Library 



Download period Directed graph 



Wikipedia 

Wikipedia, See-also links 
MathWorld 

MathWorld, See-also links 
DLMF 



September 2010 W 

September 2010 W 

August 2009 M 

August 2009 M' 

September 2010 D 



non-Referenced-by link to page b if page b contains a Referenced-by link to page 
a. Referenced-by links in DLMF are therefore redundant as far as building its 
directed graph is concerned. They are for this reason ignored. 

These observations amount to five different graphs with which to work, as 
summarized in Table [1] In the table, for each of the libraries and, when appli- 
cable, taking See-also links into account, we give the time frame within which 
the content pages were downloaded and the notation we use to refer to the 
corresponding graph. 

Some additional basic notation to be used throughout is the following. Given 
the graph under consideration, we let n stand for its number of nodes and m for 
its number of edges. For node i, is its set of in- neighbors (nodes from which 
edges are directed toward i) and Oi its set of out-neighbors (nodes toward which 
edges are directed from i). Its in-degree is 5f — its out-degree is 5~ = \Oi\, 
and its number of neighbors when edge directions are disregarded (henceforth 
referred to simply as its degree) is 5i = \Iiyj Oi\ < 5f + 6~ . Clearly, it holds 
that ma.x{6^,S^} < Si. For any two nodes i and j, dij is the distance from i 
to j, that is, the number of edges on a shortest directed path leading from i 
to j. If none exists, then dij = oo. We let Ri be the set of nodes j such that 
< dij < oo. Note that i?i = if and only if node i is a sink, i.e., Oi — 0. 

3 Global features 

We give six global features for each graph. The first two are straightforward 
and provide simple relationships between the graph's number of nodes, n, and 
its number of edges, m. The first one is simply the graph's mean in-degree, 
denoted by (5+ and given by 



(necessarily equal to the graph's mean out-degree). The second feature is the 
graph's mean degree. Denoting it by S, we have 



Both (5+ and S work as indicators of the graph's edge density relative to its 
number of nodes. The value of 6, in particular, may swing toward either of its 




(1) 




)=2S+. 



(2) 
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bounds, (5+ and 2(5+, indicating in the former case that every edge's antiparahel 
counterpart is also present in the graph and in the latter case that none is. On 
average, then, the fraction of S corresponding to antiparallel edge pairs is given 



Our next global feature is the fraction 5 of n that corresponds to the nodes 
inside the graph's largest strongly connected component (GSCC henceforth, 
where G is for "giant"). A strongly connected component is either a singleton 
whose only member, say node i, is such that i ^ Rj for every node j £ Ri (no 
directed path exists back from any node that can be reached from i through 
a directed path), or a larger set that is maximal with respect to the property 
that j G Ri for any two of its members i and j such that j ^ i. In the latter 
case, then, a directed path exists between any two distinct nodes inside the 
strongly connected component. Informally, the value of S can be regarded as 
an indication of the network's "degree of acyclicity." If the graph is acyclic, 
then all its strongly connected components are singletons and S = 1/n. The 
other extreme corresponds to the case in which all nodes are in the GSCC, so 



The fourth and fifth global features are both related to classifying a graph 
vis-a-vis the so-called small- world criteria [461 H] , namely small distances and 
large transitivity. We address the first criterion by computing the average dis- 
tance between any two distinct nodes, so long as only finite distances are con- 
sidered. We denote this average by £, which is then such that 



where N is the number oii, j pairs contributing to the double summation. As for 
the second criterion, that of transitivity, we follow the usual trend of disregard- 
ing edge directions and computing the resulting graph's clustering coefficient 
in its most common formulation |39j . If C is the clustering coefficient, then 
this formulation lets C = 3t/T, where both t and T refer to node triples in the 
graph, e.g., i,j,k. The value of t is meant to refiect the number of triangles 
in the graph, that is, those triples in which an edge connects i and j, another 
connects j and fc, and yet another connects i and k. The value of T, on the 
other hand, counts the triples that are arranged as three-node (two-edge) paths. 
The factor 3 in the numerator of the ratio defining C refiects the fact that there 
are three triples of the latter type for each triangle in the graph. It follows 
that < C < 1 (no transitivity through full transitivity). In our analysis of 
each graph's clustering coefficient C, we present it side-by-side with the value 
it would have if every node i continued to have the same degree Si but the 
connections were made at random ^39^. This value, denoted by C", is given by 



by {25+ ~5)/5^2S+/5-l. 



S = l. 




(3) 



((5(^) - 6f 



(4) 



where J^^) = (1/n) J2^S^■ 
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Our last global feature is in fact a series of four assortativity coefficients. 
Each one is the Pearson correlation coefficient of two length-m sequences of 
numbers. If ai, a2, . . . , a™ and /32, . . . , /3m are the sequences, /Xq, and 
are the corresponding means, and CTq. and are the corresponding standard 
deviations, this coefficient is 

(1/m) "e/3e " MaA«/3 
(JaCFp 

The original assortativity coefficient is obtained by letting = S~ and /3e — Sj' 
for e the edge directed from i to j [351 That is, it measures how correlated 
the out-degrees of the edges' tail nodes are with the in-degrees of the edges' 
head nodes. A shorthand for this formulation is to use out, in in place of a,/3. 
We get the other three variations by selecting the other possible combinations 
(in, out; out, out; in, in) [23l [42]. 

The global features of the graphs in Table [T] are shown in Tables [2] and [H 
which include an additional row for the directed graph, denoted by W^, that 
corresponds to the entire English-language Wikipedia of a relatively recent past 
[16l|48]. Tabled moreover, contains one further row for the whole Web, now 
based on data from an older past [Hjlfl The corresponding directed graph 
is denoted by W* . Not all global features are available for or W* , as 
indicated by blank entries in the tables. Graphs are arranged in Tables [2] and [3] 
in nonincreasing order of n, then in decreasing order of m. 

The data shown in Table [2] indicate that edge density relative to the number 
of nodes, as given by (5"*", has the same order of magnitude for most graphs, the 
exception being W, the Wikipedia graph based exclusively on See-also links, 
whose value is one order of magnitude lower. Wikipedia contributors to 
the mathematical pages, therefore, seem to deploy See-also links considerably 
less methodically than those who contribute to MathWorld. It is also worth 
noting that the five mathematics-related graphs have fairly different values for 
the ratio 25~^ /5 — 1, pointing at W as the graph with the fewest antiparallel 
edge pairs contributing to degrees on average, and to M', the MathWorld graph 
based on See-also links, as having the most. Once again, then, MathWorld 
contributors appear more meticulous at providing cross-referencing information 
of the See-also form. 

One of the most striking contrasts in Table [5] concerns the value of S, the 
size of the graph's GSCC relative to n. While for the Web graph W* our 
current best estimate places about 28% of the nodes inside the GSCC, for the 
Wikipedia graph and most of the mathematical-library graphs we have 
been considering the GSCC encompasses substantially more nodes (between 62 
and 82%). The exception, once again, occurs on account of graph W', whose 
GSCC is sized at a mere 2% of the nodes, and which as we have noted is only 
very sparsely interconnected by the See-also links. 

* Slightly more recent data seem to indicate an S value of roughly 0.33 for a similarly sized 
Web |19| . but no estimate is given for £. 
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(5) 



Table 2: Global features: mean in- or out-degrcc (5+), mean degree {S) and the resulting value of 2d~^/6 — 1, fraction of n 
within the GSCC (S*), average distance between distinct nodes {(.), and clustering coefficient (C, along with the value, C , it 
would have if connections were random). 



Graph 


n 


m 


5+ 


5 


25+15-1 


S 


I 


C 


C 




W* 


203 549 046 


1466 000 000 


7.20 






0.28 


16.18 








w+ 


339 834 


5 278 037 


15.53 






0.82 


4.90 








w 


37 723 


688 589 


18.25 


30.62 


0.19 


0.80 


4.11 


0.055 


7.59 X 10- 


-4 


w 


37 723 


21503 


0.57 


1.04 


0.09 


0.02 


15.27 


0.061 


4.96 X 10' 


'8 


M 


15 095 


92 648 


6.14 


9.72 


0.26 


0.78 


5.32 


0.048 


5.18 X 10" 


-4 


M' 


15 095 


46 965 


3.11 


4.45 


0.40 


0.62 


7.45 


0.093 


1.77 X 10" 


-4 


D 


908 


7527 


8.29 


12.81 


0.29 


0.81 


3.79 


0.062 


0.011 





Table 3: Global features: assortativity coefficients. 



Graph 


^out.in 


^in.out 


^out,out 


^in,iii 


W+ 


-O.fSO 








w 


~0.07f 


0.075 


-0.074 


-0.022 


w 


0.04f 


0.094 


0.070 


0.028 


M 


-0.037 


-0.018 


-0.015 


-0.019 


M' 


-0.054 


-0.031 


-0.058 


-0.036 


D 


-0.f69 


0.006 


-0.053 


-0.043 



The remaining data in Table [2] refer to £ and to C, a graph's average path 
length (in the directed sense) and clustering coefficient (in the undirected sense), 
respectively. We first note that, for six of the seven graphs, £ is proportional 
to Inn by a constant of the order of 10~^, the exception being W\ for which 
the proportionality constant is roughly 1.45 (this comes from the substantially 
larger distances in comparison to W, as expected from the substantially lower 
(5"*" value). In all cases, however, distances are on average very small given the 
value of n, so all seven graphs qualify as small-world structures. Moreover, as is 
usually but not always the case [35] , in all five mathematics-related graphs the 
value of C is noticeably larger than that of C". In fact, except for the DLMF 
graph D, C surpasses C" by a factor of at least two orders of magnitude. The 
construction of D, which has C ~ 5.64C", seems to have been guided by forces 
that prevent the formation of triangles more than they do in the other four cases. 
One possible explanation is that, in comparison to Wikipedia or MathWorld, 
each DLMF page contains substantially more material, which in fact is reflected 
in the low number of nodes of graph D. 

Table [3] contains all four assortativity coefficients for all five mathematics- 
related graphs and the unrestricted Wikipedia graph. The vast majority of 
all values is of the order of 10~^ at most, being therefore sufficiently near zero for 
the sequences involved to be taken as uncorrelated. In general this is indicative 
either of a random pattern of connections (which is not the case) or that criteria 
for edge deployment are at work that make no reference whatsoever to in- or 
out-degrees (which is more plausibly the case). Curiously, though, the same 
holds also for the only two exceptions, and Z?, for which the moderately 
negative but nonnegligible value of rout, in is suggestive that in these two graphs 
connections are effected in such a way that promotes a small but noticeable 
degree of disassortative mixing of tail nodes' out-degrees with head nodes' in- 
degrees. That is, there is a slight tendency of nodes with larger (smaller) out- 
degrees to connect out to nodes with smaller (larger) in-degrees. This tendency 
is quantified very similarly by Tout, in for both and D (—0.150 in the former 
case, —0.169 in the latter). Perhaps the aforementioned fact that the typical 
DLMF page contains more material than the other libraries' pages somehow 
makes D resemble in this one aspect. 
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4 Local features 



Presenting a graph's local features requires that we value each feature of inter- 
est for each node and then provide some probability distribution of that feature 
over the entire graph. In this section we work with the feature's complementary 
cumulative distribution (CCD henceforth), denoted by F{z) for an admissible 
feature value z, which is the probability that a randomly chosen node has a fea- 
ture value that surpasses z. We compute F(z) as the fraction of n representing 
the nodes for which the feature is valued beyond z. Clearly, if for a graph the 
feature in question is never valued beyond Z, then F(z) = for all z > Z. 

The most widely studied local features are a node's in-degree, out-degree, 
and degree. Not only have they been measured in a variety of domains, but 
knowledge of how they are distributed can be used in the study of many other 
network properties |40| . These features are the first three we study, as character- 
izations of in- and out-degrees have over the years led to important discoveries 
regarding the Web and the unrestricted Wikipedia. Specifically, we know from 
at least two independent sources operating on different data that in-degrees in 
the Web graph (our W* graph in one case, a different version in the other) are 
distributed according to a power law [2l[T4l[T9]. That is, the probability that a 
randomly chosen node has in-degree fc > is proportional to fc^" (so the cor- 
responding CCD is approximately proportional to k^^") for a ~ 2.1. Similar 
power laws have also been reported for the graph's out-degrees, but in this case 
there seems to be some disagreement [TH]. As for the Wikipedia graph, 
its in-degree, out-degree, and degree distributions have all been found to follow 
power laws, of exponents —2.21, between —2.65 and —2, and —2.37, respectively 
[TSJIH] . Power laws are inherently scale- free [5Bj , and their appearance in graphs 
such as W* and has been explained particularly well by the mechanism of 
edge deployment known as preferential attachment [331 ISl 13 ES] • 

The additional local features that we consider are the ones given in Table HI 
Four of them (Bi, Si, Ci, and Gi) are measures of node i's centrality in the 
graph, being therefore related to shortest directed paths in which i participates 
in some way. The remaining three are related to search mechanisms on the Web. 
They are measures of how node i qualifies as a hub (yi) or an authority (xi) in 
the HITS (Hyperlink-Induced Topic Search) mechanism, and the node's page 
rank (pi), which underlies Google searches. 

The centrality features can be computed through variations of a well-known 
algorithm [12 , and similarly the other three, though requiring iterative updates 
for convergence. In the case of the HITS-related features, first every Xi and 
is initialized to 1. Then the Xi^s and y^'s are alternately updated via the rules 
given in Table SI The updating of the x^'s is followed by a normalization of the 
resulting values so that ^^1 — 1j which is achieved by dividing each Xi by 
the Euclidean norm of the vector of components xi,X2, ■ ■ ■ ,Xn- The updating 
of the j/i's proceeds similarly. After convergence, all features are normalized so 
that = J^iVi = 1- fo^' the page-rank feature, once again every pi is 

initialized to 1 and the update rule given in Tabled] is iteratively applied until 
convergence, at which time all pi's are normalized so that ~ 1- I^^r the 
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Table 4: Additional local features for node i. ajk is the number of shortest directed paths that lead from j to k, while 
counts only the ones that go through i. 



Designation 



Betweenness centrality 
Stress centrality 

Closeness centrality 

Graph centrality 

HITS update rule for hubs 

HITS update rule for authorities 



Formula 



= E E 



j^i k^i 



Gi = < max,g_R; di 



-, ifi?. ^0 
0, otherwise 



0, otherwise 



E 

jeii 



Page-rank update rule with damping factor 0.85 pi :— 0.15 + 0.85 ^ 



Reference(s) 



m 

m 

m 
m 
m 
m 



two HITS-related features and page rank, our criterion for convergence has been 
that, for all nodes, the two latest feature values differ from each other by some 
quantity in the interval [— lO"""^^, 10~^^]. 

CCD plots for the local features are given in Figure [T] (in-degree, out-degree, 
and degree). Figure [2] (centralities), and Figure [3] (hub, authority, and page 
rank) . One striking characteristic they all share is that no feature of any library 
seems to be expressible as a clear power law for any significant number of or- 
ders of magnitude. For example, although we have found the in-degree CCD for 
DLMF to be given approximately by a power-law of a = 2.47, this seems rea- 
sonable only for one order of magnitude (roughly between 10 and 100). In the 
case of Figure [TJ in particular, this widespread absence of a power law works to 
confirm the expectation that, in such a specialized domain as the five libraries', 
it is expertise, rather than some popularity-based criterion such as preferential 
attachment, that guides the establishment of connections. 

In Figure [H the CCD plots for the Ci and Gi values share the peculiar 
property that all nodes are concentrated inside three relatively narrow centrality 
intervals. For each of the five libraries, first are the sink nodes, those for which 
Ri = Oi = $, having Ci — Gi — 0. Then comes what in almost all cases is the 
most densely populated interval. Nodes in this interval have the relatively small 
centrality values typically associated with relatively large distances to the nodes 
in Ri. They are members of the graph's largest so-called in-component, which 
encompasses the GSCC and all nodes from which at least one directed path leads 
to the GSCC. This explains the single exception, which once again concerns the 
small-GSCC graph of the Wikipedia library with only See-also links (W). The 
third centrality interval contains the remaining nodes and is characterized by 
centrality values that in almost all cases bespeak relatively small distances to 
the nodes in Ri. These nodes lie outside the graph's largest in-component and, 
once again, the single exception is relative to W. 

5 Local features and GSCC disruption 

In the graphs we have been studying, as in all graphs reflecting real-world net- 
works, the existence of the GSCC is merely a matter of observation: we simply 
look for the graph's strongly connected components and select the largest one. 
In a more abstract sense, however, random-graph models of networks have been 
studied for the existence of such components under a growth regime from rel- 
ative sparseness to relative denseness (that is, as the graph's number of nodes 
and/or edges is changed so that it becomes denser). Such studies were initiated 
with the Erdos-Renyi (ER) random graphs |21) , which are undirected and char- 
acterized by a Poisson distribution of node degrees. Since edges do not have 
directions in the ER model, one looks for weakly (rather than strongly) con- 
nected components, or simply connected components, and for the GCC (rather 
than the GSCC). It turns out that, increasing 6 (the mean degree) past 1 as the 
graph becomes denser gives sudden rise to the GCC as a connected component 
that for the first time is set apart from the others by virtue of its size [22] . A 
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Figure 1: CCD plots for the 6f' (a), (b), and 6i (c) values of W (Wikipedia), 
W (Wikipedia, See also), M (MathWorld), M' (MathWorld, See also), and D 
(DLMF). 
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1e-06 1e-05 0.0001 0.001 0.01 0.1 
Closeness centrality 



Figure 2: CCD plots for the Bi (a), S^ (b), d (c), and G^ (d) values of W 
(Wikipedia), W (Wikipedia, See also), M (MathWorld), M' (MathWorld, See 
also), and D (DLMF). 
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Figure 2: Continued. 



similar phenomenon also occurs in many other random-graph models, including 
their directed variations with regard to the rise of the GSCC [30 l [33 t [34 l [20 l l40] . 

Another similar phenomenon, often called site percolation, is the breakdown 
of the GCC or GSCC when nodes are continually isolated from the rest of the 
graph by the removal of all edges incident to them. In the case of ER graphs, 
for example, the GCC breakdown is expected to happen after a fraction 1 — 1/5 
of the nodes has been randomly isolated [9 , provided (5 > 1 to begin with (i.e., 
provided there really is a GCC initially). Results of this sort have been obtained 
also for undirected graphs with degrees obeying a scale-free distribution. How- 
ever, unlike the ER graphs, with their degrees closely clustered about the mean, 
now there may exist high-degree nodes, so it makes sense to look at targeted 
as well as random node-isolation processes. As it turns out, for a = 2.5 (which 
is thought to describe the Internet graph) the GCC is only expected to disap- 
pear after at least 99% of the nodes have been randomly isolated, although for 
relatively small graphs this can be as low as about 80% [TF . Targeting highest- 
degree nodes first, though, implies that isolating fewer than 20% of the nodes is 
expected to suffice [3] . We know of no similar studies for directed random-graph 
models regarding the impact of node isolation on the graph's GSCC. So, despite 
the figures given above, we are essentially left without any meaningful clue as 
to what to expect when we conduct node isolation in our five mathematical 
libraries. 

The results we present in this section describe the evolution of S, the fraction 
of n inside the GSCC, as nodes are isolated either randomly or targeting first 
the non-isolated node for which a specific local feature is highest. In the former 
case we provide the average value obtained from ten independent trials. As for 
the local feature in question, we report on all ten discussed in Section |4l In all 
cases, node isolation is performed until no strongly connected component has 
more than one node. When isolation stops, then, all remaining nodes are either 
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Figure 3: CCD plots for the yi (a), Xi (b), and pi (c) values of W (Wikipedia), 
W (Wikipedia, See also), M (MathWorld), M' (MathWorld, See also), and D 
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isolated (no in- or out-neighbors) or part of an acyclic portion of the current 
graph. 

Our results appear in Figure |H where the breakdown fractions for random 
isolation are seen to be in the [0.4, 0.6] interval for the non-See-also graphs, along 
with roughly 0.3 for M' and less than 0.01 for W' . If we once again except W' ^ 
with its frail GSCC, and maybe M' as well, we are left with figures that indicate 
what seem to be quite resilient GSCCs in M, W ^ and D. As we turn to the 
isolation of nodes following one of the local features, the data in Figure 2] reveal 
that the specific feature in question is practically irrelevant, with the exception 
of graph centrality and closeness centrality in all cases but that of W . These 
two features are, respectively, the second and third least effective means we 
have found to break the GSCC (following the random method, which is the 
least effective). Except for graph and closeness centrality, the data also reveal 
a breakdown fraction of about 0.2 for W and then a little above 0.1 for M, 
then a little below 0.1 for M' and, finally, less than 0.01 for W . Targeting nodes 
based on any one of these local features, then, reveals a dividing line between 
the See-also and non-See-also graphs as well, with W' once more at the lowest 
end and M' in between W and the non-See-also group. 

6 Local features and text search 

Google's search engine grew out of the notion of page rank, one of the ten 
local features examined in Section |4l Page rank, however, is no more of a node 
descriptor than any of the other local features, so in principle it is at least 
conceivable that any of the others might be used instead. We explore such 
possibility in this section for each of the five directed graphs W, W, M, M', 
and D, regarding the search, in their nodes' texts, for a number of the top 
keywords in mathematics as reported at the Microsoft Academic Search (MAS) 
sited as of November 1, 2012. 

We follow the standard method outlined in (5j. For each graph and local 
feature, and for a given query (one of the aforementioned keywords), this method 
begins by identifying a list A of answer nodes (sorted by nonincreasing feature 
value) as well as a set R of relevant nodes. It then proceeds to calculating the 
well-known Precision and Recall metrics for each k — 1,2,..., These are 
given by the fraction of k corresponding to the nodes in the size-fc prefix of list 
A that are also in R (Precision) and the fraction of \R\ that corresponds to these 
shared nodes (Recall). Note that, the more relevant nodes are ranked first in 
A, the higher Precision values are obtained for a larger stretch of Recall values. 

The elements of A are simply those nodes whose texts contain the keyword 
in question. As for R, normally it would be identified by a group of experts. In 
the absence of one, however, we identify it by resorting to all ten local features, 
not just the feature that is being analyzed and was used to sort A, and letting 
each one "vote" for or against each potential candidate for inclusion in R. Set 

^http: //academic . research .microsoft . com/RajikList?entitytype=8&topDomainID=15& 
subDomainID=0 . 
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R, therefore, is as much a function of the feature used to sort A as it is of the 
others. The following steps summarize the construction of set R: 

1. Let X be the set of nodes in whose texts the desired keyword appears. If 
|X| < 10, go to Step 5. 

2. Create ten sorted lists of the nodes in X, each by nonincreasing order of 
one of the ten local features. 

3. Let Y be the set of nodes that appear amid the top ten nodes in a strict 
majority (i.e., at least six) of the ten lists. 

4. Let R -.— Y and stop. 

5. Let i? := and stop. 

Note that requiring \X\ > 10 for termination to occur in Step 4 is necessary 
to avoid the trivial case oi R = X, which allows for no discrimination of the 
local features vis-a-vis one another. When the requirement is not met and 
termination occurs in Step 5, the query in question is dropped. 

Our results, given next, refer to those MAS keywords, out of the top 300, for 
which the procedure above terminated in Step 4 in our experiments. Whenever 
such keywords numbered more than 100, we considered only the top 100. As 
it turns out, we obtained the desired 100 keywords for all graphs but D, which 
ended up with only 14 keywords (i.e., only 14 of the 300 keywords were found 
in more than ten nodes). Figure [S] contains the resulting Precision- Recall plots. 
They are given as Precision averages relative to eleven Recall intervals, viz. 
[0, 0.1), [0.1, 0.2), ...,[1,1], plotted respectively at the abscissae 0, 0.1, . . . , 1. 

According to the data in Figure [SI in order to search the mathematical 
portion of Wikipedia through the use of local features based on graph W it 
is best to use page rank, followed very closely by either the hub or authority 
feature. Should the search be based on graph W' , however, then one should use 
the hub criterion as the absolute champion. Notice, notwithstanding this, that 
the use of W incurs a loss of Precision of about 10% relative to the use of W 
and cannot be recommended on any grounds. Still regarding Wikipedia, our 
data also indicate that, if one is willing to examine the list A of answer nodes 
past the point at which about 70% of the set R of relevant nodes have been 
covered, then the nodes' degrees and stress centralities turn out to be the local 
features to be preferred for sorting A. 

Turning to MathWorld we find a wholly different picture in the data, since 
now the best local feature to sort A is the nodes' stress-centrality values as 
given by graph M, regardless of how much of R one is willing to examine. 
If one is willing to examine no more than about 50% of R, though, then the 
nodes' betweenness-centrality values are equally effective. As for using graph 
M', and unlike the case of Wikipedia, only a small loss of Precision is incurred 
in comparison with M (about 1 or 2%), but now the local feature of preference 
to sort A is betweenness centrality, followed very closely by the nodes' degrees 
(for examining up to about 60% of R). 
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Figure 5: Continued. 
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As for DLMF, the local feature of choice is once again betweenness centrality 
(for up to about 40% of R), though the nodes' authority values are equally 
effective (up to about 20% of R), and so are the nodes' stress-centrality values 
and in-degrees (up to about 10% of R). Should one be willing to examine about 
50% of set R or more, then stress centrality becomes the local feature to be 
preferred. 

7 Conclusions 

We have studied three online mathematical libraries, viz. the mathematical 
portion of Wikipedia, MathWorld, and DLMF, from the perspective of network 
theory. To this end, we considered directed graphs whose nodes are library 
pages and whose edges reflect the directed pairing of pages through the links 
that point from one to another. In the case of Wikipedia and MathWorld, these 
links come in two clearly identifiable categories (those that are in-text and those 
in the pages' See- also sections), so we considered two separate graphs for each of 
these libraries. We focused on both global and local network-theoretic properties 
of these graphs, aiming at characterizing them, studying their resiliency to the 
accidental or intentional loss of material, and also assessing how best to perform 
text search in the pages that their nodes stand for. 

Among our key finds are the presence of GSCCs that in most cases en- 
compass node fractions substantially larger than that of the Web, indications 
of small-world phenomena, practically no signs of relevant assortativity in the 
linking patterns, and the absence of any clear power laws describing the distri- 
butions of local features. We also found that most graphs are quite resilient to 
the accidental loss of material, though naturally less so when we consider the 
intentional destruction of pages. As for searching the libraries for the occur- 
rence of specific keywords, only for Wikipedia do the customary criteria of page 
rank and the HITS-related features perform best. For the smaller MathWorld 
and DLMF, primacy is taken by local features that hitherto do not appear to 
have been considered for this purpose, notably stress centrality, betweenness 
centrality, and the nodes' degrees. 

We believe that many of these finds can be attributed to one key distinguish- 
ing property of all three libraries. Unlike what happens in several other domains, 
where such intangibles as affinity or popularity dictate the establishment of con- 
nections, in building these libraries what matters is how knowledgeable each 
contributor is on the core material being treated and on how it relates to the 
other topics. That this key distinction should surface in the form of measurable 
effects such as the networks' structural properties and their consequences, and 
that this should happen despite the typically large number of often independent 
contributors involved, is quite remarkable. 

We finalize with a note on some related work on MathWorld that precedes 
our own analysis 29 . Such work is based on a December 2008 version of the 
library, so it predates the one we use by some eight months (cf. Table[T|). Despite 
this relatively short span of intervening time, our graph has 25% more nodes 
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(about 3 000 nodes beyond that work's 12 000), so we conjecture that some 
intermittent failure during the download process may have caused the loss of 
material. In [53] the authors give the distributions of in- and out-degrees and 
of betweenness centrality. Despite the considerable difference between the two 
graphs, our results agree with theirs in that neither in-degrees nor out-degrees 
are distributed as power laws. Their betweenness-centrality distribution also 
appears consistent with ours, though they seem to have missed the page for 
"Triangle," which we find to be one of the top ten for this local feature but they 
do not. They also discuss clustering, average distance, and assortativity, but 
the definitions they use for these quantities are not the most commonly used 
and are incompatible with ours. 
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